This module provides an introduction to the R statistical package. On completion of this module you should:
Donegan K, King B and Bryan P. Safety of pertussis vaccination in pregnant women in UK: observational study. BMJ 2014;349:g4219 doi: 10.1136/bmj.g4219 (Published 11 July 2014).
Gregoire AJ, Kumar R, Everitt B, Henderson AF, Studd JW. Transdermal oestrogen for treatment of severe postnatal depression. Lancet. 1996 Apr 6;347(9006):930-3.
Rabe-Hesketh S and Everitt B. A Handbook of Statistical Analyses Using R; Fourth Edition. 2007, Chapman & Hall/CRC.
Spanos A, Harrell FE, Durack DT. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 1989; 262: 2700-2707.
R is a very powerful, and relatively easy to use Statistical package which allows you to undertake data management and manipulation, perform a range of basic and complex statistical tests and analyses, and generate a wide variety of different graphs. For this course, you will be required to use R to undertake basic analysis and graphics.
R is a dialect of the S language. It is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists. Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed.
R packages are bundles of functions which extend the capabilities of R. Thousands of add-on packages are available in the main online repository (CRAN) and several more packages can be found on GitHub. These packages can be found and installed online.
Stata and R have minor difference in default setting and methods. In this document, we will follow the Stata analysis as closely as possible, but small and usually important difference in Stata and R will be noted. Unlike Stata, R can hold multiple datasets in memory simultaneously and so, there is no need to save intermediate files or close and re-open datasets.
You need to communicate with R in a way that the program understands; i.e. you need to be able to tell R what you want in a way that it understands. You can do this by typing in commands using an appropriate “language”.
Syntax, also sometimes referred to as “code” or “programming”, refers to the grammatical rules that you need to follow so that R understands what you want it to do. Syntax does not necessarily follow rules of English grammar! You tell R to do something by using the correct command or set of commands. An R command is a combination of syntax elements, expressed following the rules of ‘R language’.
R Studio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, and debugging and workspace management. It is available in open source and commercial editions on the desktop (Windows, Mac, and Linux) and from a web browser to a Linux server running R Studio Server or R Studio Server Pro. In this course, we will be using R Studio. The R Studio environment is similar to the Stata environment and we hope will help ease the transition from Stata to R.
Ensure you have the latest Java installed. Download and install R and RStudio
Open the R program using RStudio by double-clicking on the icon (as you would open any other program that you are familiar with).
The diagram below shows the open RStudio program; you will see that there are various different components or windows. The R windows are:
R Script– where you type in R commands one by one (visible once you open or create a new R scropt file)
Console window – this shows a log of everything you do in R session, including commands and results of these commands
History – keeps a record of every command you enter in R session
Environment–stored all the datasets, arrays, lists and variables created in R
Figures – Displays all plots and figures created in R. You can use the arrows to move between images. The export option can be used to save the plots as pdf or image, or to copy plots to clipboard.
Packages – you can select the packages to be installed from the list of packages
The first dataset we will be using has been modified from a dataset for 581 patients with acute meningitis (Table 1). Data from a subset of 422 patients were used to examine factors associated with differentiating between acute viral or acute bacterial meningitis1. The original dataset can be found at: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/Cabm.html
The dataset is available as meningitis.csv dataset: A comma delimited Excel file which can be imported into R or other statistical programs; this type of file has a .csv extension.
Table1: Format of meningitis dataset
Create a new R project under “File > New Project…” and select “New Directory > New Project” (noting file path so that you know where to save .csv files etc. on your computer).
The project file name you just created will end in .Rproj. Locate the folder containing this file. This will be your working directory. This method is recommended each time you are working on a new project.
Save all .csv files to the folder containing your .Rproj file (you may need to convert some data files to .csv files).
To import data from a .csv file: This can be done by saving your excel or R file as a .csv file and importing in directly into R
meningitis <- read.csv("meningitis.csv")
Menu driven method:
You can import data in another format, using the File Menu. These include, csv, excel, SPSS, SAS and Stata files. Click on the File > Import Dataset, and specify the type of file you want to open – for .csv file, specify “From Text (readr)”.
Use the “browse” icon to specify the directory and dataset to be opened.
You can preview the first 50 observation in the dataset and the corresponding code to open the dataset.
Note: the code displayed varies from the one mentioned above. This is an example of the flexibility R provides the user with. There are multiple ways to do the same thing in R.
The file has the variable names in the first row – you therefore you need to click on this box.
When you specify the required dataset, R will show you part of the dataset, as shown above. You can see that the first row of the file includes the names of the variables (this will generally be the case for most datasets that you will be working with), and you need to check the box First row as variable names.
Open the dataset using the read.csv command or File Import Menu. You can play around with opening the different
types of datasets.
What do you see in each of the windows when you open the dataset?
R Studio has a nice feature where everything is in one browser window, so you can browse your dataset and your code without having to switch between windows.
#to view your dataset
View(meningitis)
Alternatively, you can also view your dataset by clicking on “meningitis” in on the top right panel in R studio browser (in the “Environment” panel), as shown below:
How many variables are there? How many observations?
What do you think that the NA values mean? (More on this later!)
The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages. Example: To access documentation for the standard lm (linear model) function, for example, enter the command help(lm) or help("lm"), or ?lm in R console. Another way is shown below:
1. Type the word summary(dataset) in the console and press enter.
What has happened? How many variables are there? How many observations?
2. Type summary(sex) into the command bar.
How is this result different to the one above?
Variables in R can contain numbers or strings (characters which can be letters or numbers, or sometimes other special characters). String variables can be up to 244 characters long. Numeric variables are stored either as integers (bytes, integers or longs) or floating point (float or double). The variable type is listed under the second column in the str() command output. You can check the length of the character variables using the nchar() command.
It is important to be familiar with different types of variables and the way they are stored in R. Some analyses require that data be stored as numeric rather than string variables. This means for example that sex is better stored as numbers (e.g. 0 or 1) rather than string variables (ie “female” or “male”).
While storing variable values as numbers is important for analysis, we need to be know what each of these values represent; for example if we code sex as 0 or 1, which value represents males and which females? We need to have a data dictionary, which “translates” the numeric values in to character labels. You can label the numeric values of a variable in R so that the words rather than the numeric values are displayed; the numeric values will be retained for analysis purposes. We will discuss this later.
Type dataset_name in the console and press enter. What does this show?
Type dataset_name$var1 then press enter. What does this show?
Tabulating data:
#attach the dataset before analysis
attach(meningitis_csv_dataset)
table (sex)
# or you can avoid the 'attach(dataset)' by using the '$' operator to specify which dataset the variable comes from:
table(meningitis$sex)
What does the results window show? How many males and females are in the data set?
tab <- table(meningitis$sex, meningitis$abm)
tab
margin.table(tab, 1) # A frequencies (summed over B)
margin.table(tab, 2) # B frequencies (summed over A)
prop.table(tab) # Cell percentages
prop.table(tab,1) # Row percentages
We have previously generated the table of acute bacterial meningitis by sex, with row percentages (ie the percentage of males and females who had acute bacterial meningitis). Now use R commands to generate column percentage;
i.e. the % of those with and without acute bacterial meningitis who were female.
One method of obtaining descriptive statistics is to use the summary command and follow it with the sapply( ) function with a specified summary statistic (e.g. SD).
sapply(meningitis$wbc, sd, na.rm=TRUE)
Other possible functions used in sapply include mean, var, min, max, median, range, and quantile.
Alternatively, you can also use describe() command available in library(psych) - after first downloading the package under Packages tab > install, in the bottom right hand RStudio window.
What is the minimum, maximum, mean, standard deviation and median value for blood glucose?
library(psych)
describe(meningitis$bloodgl)
The order command allows you to arrange the observations in the dataset in increasing or decreasing order of a particular variable, or set of variables. For example you may want to sort the observations in order of patient id number
The default method of sorting is in ascending order. To sort in decreasing order a minus sign (ie “-“) must be placed in front of the variable to be sorted. To preserve the original dataset, we create two new datasets (m1) and (m2) where:
m1 is the dataset where we have sorted the original dataset in ascending order of variable “casenum”.
m2 is the dataset where we have sorted the original dataset in descending order of variable “casenum”.
You can confirm this by viewing both datasets, as shown below.
m1 <- meningitis[order(meningitis$casenum),]
head(m1, n=10) # selects first 10 rows for you to view - observe the order of the variable "casenum"
## casenum year month age race sex wbc bloodgl gl pr whites polys
## 1 1 78 1 4.00000000 1 2 6.5 165 3 304 4000 100
## 2 2 78 12 1.00000000 1 1 3.7 150 92 NA 5490 97
## 3 3 78 3 0.79999995 1 1 NA 183 36 NA 4500 100
## 4 4 78 8 54.00000000 1 1 7.5 NA 52 43 0 NA
## 5 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 6 NA NA NA NA NA NA NA NA NA NA NA
## 7 7 78 12 0.50000000 1 1 NA NA NA NA 16500 100
## 8 8 78 2 0.09999996 1 1 NA 95 54 NA 505 100
## 9 9 NA NA NA NA NA NA NA NA NA NA NA
## 10 10 78 12 11.00000000 2 2 NA NA 35 175 700 1
## gram abm
## 1 4 1
## 2 5 1
## 3 0 NA
## 4 0 1
## 5 .z 0
## 6 .z 0
## 7 .z 1
## 8 0 1
## 9 .z 0
## 10 0 NA
m2 <- meningitis[order(-meningitis$casenum),]
head(m2, n=10)
## casenum year month age race sex wbc bloodgl gl pr whites polys
## 581 581 70 7 13.0 2 1 12.500000 122 35 176 8145 82
## 580 580 70 7 1.0 1 2 11.599999 NA 73 NA 240 50
## 579 579 70 8 8.0 1 1 15.099999 140 95 NA 69 88
## 578 578 70 7 23.0 1 1 10.599999 NA 84 32 242 4
## 577 577 70 7 2.0 1 2 4.900000 NA 85 18 100 90
## 576 576 70 7 14.0 1 2 12.099999 110 60 37 121 80
## 575 575 70 7 0.6 1 1 10.500000 NA NA 36 36 6
## 574 574 70 11 24.0 2 1 7.299999 116 67 NA 34 62
## 573 573 70 6 22.0 2 1 12.599999 143 54 NA 7900 93
## 572 572 72 7 25.0 2 1 8.000000 150 86 NA 193 1
## gram abm
## 581 2 1
## 580 0 NA
## 579 .z 0
## 578 .z 0
## 577 0 1
## 576 .z 0
## 575 .z 0
## 574 .z 0
## 573 0 NA
## 572 .z 0
What has now happened to the observations?
Multiple variables can be used to sort the observations. For example, a dataset includes individuals with more than one record, such as baseline data and data for two follow-up visits, we could sort by patient and id and then visit number (or date). We can sort one (or more) variable in increasing order and another (or more) in decreasing order. When we list more than one variable for sorting, observations are sorted by the first variable and then for each value of the first variable, observations are ordered by the second variable, and so on. You can sort by any type of variable – continuous, categorical, and binary or string.
Sort the observations by increasing year of admission and then by increasing age, and look at the data. Now sort by increasing year of admission and decreasing age and see what has happened. Sort by CSF glucose within sex. How has R dealt with missing values?
Some procedures require that the data be sorted by particular variable/s. The menus will often undertake a data sort for many procedures requiring sorted data, however using R code / commands may still require you to sort the data before performing a particular procedure or operation; If that is the case R will give you an error message specifying that the data have not been sorted.
R doesn’t use the if and by command as STATA to repeat commands across groups. It is done differently as shown below.
To produce the separate tables of acute bacterial meningitis for males and females, we can tabulate abm with sex as we did above:
table(meningitis$abm, meningitis$sex)
##
## 1 2
## 0 118 85
## 1 113 104
# to make it easier to interpret, we can add table labels:
table(abm=meningitis$abm,sex=meningitis$sex)
## sex
## abm 1 2
## 0 118 85
## 1 113 104
Does acute bacterial meningitis appear to be higher for males or females? Does R account for missing values?
Table ignores missing values. To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor using newfactor <- factor(oldfactor, exclude=NULL).
If we are interested in obtaining summary statistics on white blood count for females, we can use the summarize command (without and with the detail option):
summary(meningitis$wbc[meningitis$sex==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.70 8.30 11.80 13.33 16.20 53.50 119
Note that we have to specify two equal signs: == not =. Double equal signs are always required for such commands. This is one of the syntax rules of R and the command won’t run if this is not specified correctly.
You can also use the above command to specify a range of values to be included in the analysis:
table(meningitis$abm[meningitis$wbc>=5 & meningitis$wbc<=10])
##
## 0 1
## 94 32
What is the mean CSF protein for participants aged between 20 and 70?
What proportion of white females had acute bacterial meningitis?
summary(meningitis$pr[meningitis$age>=20 & meningitis$age<=70])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 53.0 96.0 207.4 209.8 1740.0 114
t <- table(meningitis$sex, meningitis$abm, meningitis$race==2)
prop.table(t,1)*100
## , , = FALSE
##
##
## 0 1
## 1 30.13100 27.07424
## 2 26.59574 33.51064
##
## , , = TRUE
##
##
## 0 1
## 1 20.52402 22.27074
## 2 18.61702 21.27660
In R, its is easier to keep the variables than drop them. There is no keep command, but we can specify the variables you want to retain in the dataset. It is always better to create a new dataset with the fewer variables than modify the existing dataset.
There are 3 major ways to do this in R
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
myvars <- paste("v", 1:3, sep="")
newdata <- mydata[myvars]gc
newdata <- mydata[c(1,5:10)]
Example: you can subset the meningitis dataset as:
Meningitis1 <- meningitis[c(1:11,13:14)]
To drop or exclude variables, you can use the following syntax:
myvars <- names(mydata) %in% c("v1", "v2", "v3")
newdata <- mydata[!myvars]
If we have a very large dataset and want to retain only a small number of variables then the keep statement would be more efficient.
For example if want a dataset which only includes meningitis patients aged less than 20 we could use either of the following statements:
newdata <- meningitis[ which(meningitis$age <20 ),]
Graphs are a very useful way of summarising and describing data, and enable you to look at the shape of the distribution. R has very good graphics packages including ggplot2.
We will first generate a histogram for white blood count (WBC).
# The below code generates a histogram
hist(meningitis$wbc, freq = FALSE, breaks = 17, xlim = c(0,80)) # freq FALSE will plot density on y-axis instead of frequency and xlim sets the range on the x axis to be from 0 to 80, with 17 breaks (bins)
# The below code generates the same histogram but without the breaks specification and x-axis limitation
hist(meningitis$wbc)
What do you think about the distribution for white blood count?
Generate a histogram for blood glucose. What do you think about the shape of the distribution? Regenerate the
histogram but add the breaks option and specify 20. What does this do?
Also note what is the default number of bins specified when we use the `hist()` command alone
If we want to see how close to a normal distribution our histogram is we can overlay a normal curve.
To do this, we first calculate mean and standard deviation for bloodgl.
x <- na.omit(meningitis$bloodgl)
h <- hist(x)
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Generate a graph of CSF glucose with 25 bins and a normal curve overlay. What does this graph show? Does it look reasonable? What might you interpret from this graph?
x <- na.omit(meningitis$gl)
h <- hist(x, breaks = 25, xlim = c(0,250))
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
Using the command line rather than the graphics menu, generate a graph of CSF glucose with a normal curve overlay for: a) males only; b) for participants aged between 15 and 75.
#a)
x <- subset(meningitis, sex==2) # subsetting dataset by males only
x <- na.omit(meningitis$gl)
h <- hist(x, breaks = 25, xlim = c(0,250) , main = "Histogram of glucose CSF \n for males", xlab = " glucose CSF") # main heading and x-axis titles have been added here
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
#b)
x1 <- na.omit(meningitis$gl[meningitis$age >=15 & meningitis$age<=75])
h <- hist(x1, breaks = 11, xlim = c(0,200), main = "Histogram of glucose CSF", xlab = " glucose CSF")
xfit<-seq(min(x1),max(x1),length=40)
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1))
yfit <- yfit*diff(h$mids[1:2])*length(x1)
lines(xfit, yfit, col="blue", lwd=2)
Histograms display variables one at a time. A scatterplot will show the relationship between two continuous variables. We will obtain a scatterplot of heart rate versus mean blood pressure.
#pch command will give sold circles to mark points in a graph
plot(meningitis$gl, meningitis$bloodgl, pch = 19, col = "darkblue")
There are a very large range of options in R for specifying the look of the graph, including titles, labelling and defining axes, the colour and shape of the markers (i.e. the points on the graph), legend specification and location, subsetting, putting multiple plots on the same graph, etc. An example of R code to produce a graph of CSF versus blood glucose, separately for males and females is shown below with the corresponding graph.
plot(meningitis$bloodgl, meningitis$gl, pch = c(21, 19), col= c("blue", "red")[unclass(meningitis$sex)], main = "Graph of CSF glucose vs blood glucose by sex")
#Add legend
legend(1,195, legend = c("Male", "Female"), col = c("Red","Blue"), pch = c(21, 19), cex = 0.8)
Play around with generation of scatter plots of white blood count versus total leukocytes separately for patients with and without acute bacterial meningitis. Try and change the colour and type of markers, axes, title and legends.
It is very important to retain a copy of your results so that you, or anyone can check this, and to be able to re-run your analysis at any time. In addition you will often want to, add to, repeat or revise your analyses. You do not want to have to type in your commands or re-do all of the menu analysis again, thus you also should save a copy of all of your commands.
Graphs are not saved in the results window, or the log file, and need to be individually saved.
To do this click on Export and save as either Image or pdf. You can also copy the plot onto your clipboard.
This then opens up a black page, something like a Word document, which you can put all of your commands. This is a good strategy, especially when you first start using R. Alternatively, you can also open a new R script using Ctrl+ Shift+ N.
You can add comments on a line after a hashtag (#) as shown in the R script example below.
In computer programming, a comment is a programmer-readable explanation in the source code of a computer program. They are added with the purpose of making the source code easier to understand, and are generally ignored by compilers and interpreters.
The setwd command at the start of the file tells R to change your working directory to the one specified. This means that you don’t have to specify the directory every time you open or save a file.
Have a play around with R, using both the icons and command line. Tabulate each of the variables individually;
e.g. year. Do all of the values seem reasonable? Are there any unusual or extreme values? Generate some two way
tables; e.g. race abm, and look at the relationship between these variables. See what different types of graphs
you can generate using the commands.
This section describes how to generate confidence intervals and undertake hypothesis tests using R.
You can obtain confidence intervals and undertake hypotheses testing using the R package in two ways:
As for many types of analyses there is more than one command which provide the same/similar results. This section only focuses on how to undertake hypothesis tests in R, and does not consider the appropriateness of the test, nor assumptions checks.
We want to obtain the 95% confidence interval for mean birthweight for a sample with the following features:
o 20 observations
o Sample mean of 3159 grams
o Sample standard deviation of 220 grams
Unlike Stata, we need to construct the 95% CI for mean. The 95% confidence interval for the population mean can be expressed as:
\[\overline{X}\pm1.96\sigma/\sqrt{n}\]
## Descriptive statistics
# 95% CI for mean
n <- 20
xbar <- 3159
sd <- 220
# calculate standard error
se <- sd/(sqrt(n))
E <- 1.96*se
# or alternatively, E <- qnorm(0.975)*se
ci<- xbar + c (-E, E)
ci
## [1] 3062.581 3255.419
The 95% CI is (3063, 3255). We are 95% sure that the true (population) mean lies between 3063 and 3255; if we were to undertake this study 100 times, 95 of the CIs would include the true mean.
If we would like a higher level of precision we might want to obtain 99% confidence intervals:
## Descriptive statistics
# 99% CI for mean
n <- 20
xbar <- 3159
sd <- 220
# calculate standard error
se <- sd/(sqrt(n))
E <- qnorm(0.995)*se
ci<- xbar + c (-E, E)
ci
## [1] 3032.286 3285.714
The 99% CI is (3032, 3286). What is the interpretation of this interval?
Table 3 shows the birthweight summary statistics by an observational study on the safety of pertussis vaccination in pregnant women in the UK 2.
Calculate the 95% confidence interval for mean birthweight for babies of mothers:
a) who were vaccine; and
b) who were not vaccinated
Table1: Birthweight summary statistics
We want to estimate the proportion (%) of Australian adults who have diabetes with 95% CI. Of a sample of 11247 Australian adults, 844 have diabetes. What is the proportion and 95% CI?
Like the 95% CI for mean, we need to construct the 95% CI for proportion. The 95% confidence interval for the population proportion can be expressed as:
\[\hat{p}\pm z\sqrt\frac{\hat{p}(1-\hat{p})}{n}\]
Spanos A, Harrell FE, Durack DT. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 1989; 262: 2700-2707↩
Donegan K, King B and Bryan P. Safety of pertussis vaccination in pregnant women in UK: observational study. BMJ 2014;349:g4219 doi: 10.1136/bmj.g4219 (Published 11 July 2014↩